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Abstract.  Streaming  heterogeneous  information  is  ubiquitous  in  the  era 
of  Big  Data,  which  provides  versatile  perspectives  for  more  comprehen¬ 
sive  understanding  of  behaviors  of  an  underlying  system/process.  Human 
analysis  of  these  volumes  is  infeasible,  leading  to  unprecedented  demands 
for  mathematical  tools  which  effectively  parse  and  distill  such  data.  How¬ 
ever,  the  complicated  nature  of  streaming  heterogeneous  data  prevents 
the  conventional  multivariate  data  analysis  methods  being  applied  imme¬ 
diately.  In  this  paper,  we  propose  a  novel  framework  together  with  an 
online  algorithm,  denoted  as  LSTH,  for  latent  space  tracking  from  hetero¬ 
geneous  data.  Our  method  leverages  the  advantages  of  dimension  reduc¬ 
tion,  correlation  analysis  and  sparse  learning  to  better  reveal  the  latent 
relations  among  heterogeneous  information  and  adapt  to  slow  variations 
in  streaming  data.  We  applied  our  method  on  both  synthetic  and  real 
data,  and  it  achieves  results  competitive  with  or  superior  to  the  state- 
of-the-art  in  detecting  several  different  types  of  anomalies. 


1  Introduction 

In  the  era  of  Big  Data,  heterogeneity  of  various  information  generated  from 
a  same  yet  complex  underlying  system/process  has  become  ubiquitous.  Exam¬ 
ples  of  such  heterogeneous  data  include  video  and  audio  from  a  sensor  network, 
acoustic  and  articulatory  signals  during  a  speech,  etc.  Such  heterogeneous  data 
provides  complimentary  or  augmented  depiction  of  the  system  from  different 
perspectives,  allowing  more  comprehensive  understanding  of  the  system  than 
that  from  homogeneous  data.  Albeit  the  high  dimensionality  and  heterogene¬ 
ity,  these  data  often  exhibits  low  dimensional  nature  and  can  be  characterized 
by  a  (low  dimensional)  latent  space.  Correctly  identifying  the  latent  space  ben¬ 
efits  classical  machine  learning  tasks  (e.g.,  classification  [6]),  as  well  as  more 
novel  applications  (e.g.,  the  anomaly  detection).  However,  learning  from  het¬ 
erogeneous  data  is  highly  nontrivial.  The  requirement  of  operating  in  real  time 
imposes  further  challenges  and  prevents  straightforward  extensions  of  existing 
methods. 
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Principal  Component  Analysis  (PCA)  [7]  is  arguably  the  most  well-known 
method  for  extracting  the  low  dimensional  latent  space.  A  common  assump¬ 
tion  in  applying  PCA  is  that  most  data  is  near  the  low  dimensional  space.  The 
anomalies  are  assumed  to  be  significantly  deviated  from  the  space  such  that  using 
some  simple  statistics  is  sufficient  to  identify  them.  Inspired  by  this  assumption, 
online  PCA  [12]  techniques  are  developed  to  conduct  anomaly  detection  on  data 
streams.  Representative  online  PCA  algorithms  include  [4]  as  well  as  its  exten¬ 
sion  [17]  under  union-of-subspace  assumption.  However,  PCA  based  methods  do 
not  model  the  relations  between  the  heterogeneous  data  sources.  Therefore,  PCA 
cannot  identify  anomalies  corresponding  to  violation  of  the  relations.  In  contrast, 
Canonical  Correspondence  Analysis  (CCA)  [5]  is  a  classical  method  for  analyzing 
the  relation  between  multiple  data  sources.  And  online  CCA  through  stochastic 
gradient  on  generalized  Stiefel  manifold  has  been  applied  to  anomaly  detection 
on  time  series  [19].  However,  it  still  does  not  fully  consider  the  heterogeneous 
nature  of  the  data. 

Recently,  learning  from  heterogeneous  data  has  attracted  much  attention  in 
machine  learning  community,  particularly  in  transfer  learning,  multi-task  learn¬ 
ing  and  multi- view  learning.  Transfer  learning  utilizes  an  auxiliary  source  domain 
data  to  learn  a  better  model  in  a  target  domain,  where  the  two  domains  are  often 
heterogeneous  [13].  Multi-task  learning  leverages  the  relation  between  multiple 
tasks,  each  of  which  may  work  on  a  different/heterogeneous  data  domain  [6]. 
Multi-view  learning  leverages  multiple  views  of  same  instances  for  better  mod¬ 
els  [18].  Many  of  these  works  assume  a  common  low  dimensional  latent  space, 
and  learn  a  mapping  from  each  data  source/ view  to  the  latent  space  in  a  super¬ 
vised  fashion.  However,  adapting  these  methods  to  an  online  and  unsupervised 
setting  (e.g.,  anomaly  detection  task)  is  not  straightforward. 

In  this  paper,  we  tackle  the  problem  of  online  learning  of  heterogeneous 
data  via  latent  space  tracking.  In  specific,  we  propose  a  framework  to  track  the 
low-dimensional  latent  structures  of  heterogeneous  data  and  learn  their  inher¬ 
ent  relations.  Our  formulation  incorporates  the  key  insights  underlying  PCA,  CCA, 
and  sparse  learning  to  enable  dimension  reduction  together  with  feature  selection 
for  anomaly  detection  from  heterogeneous  data.  We  develop  an  efficient  online 
algorithm  that  effectively  conducts  Latent  Space  Tracking  from  Heterogeneous 
data,  denoted  as  LSTH.  Based  on  the  learned  latent  space,  we  further  design 
an  anomaly  detection  method  that  reports  anomalies  significantly  outlying  the 
latent  space.  We  test  LSTH  on  both  synthetic  and  real  datasets.  Experimental 
results  demonstrate  that  LSTH  is  effective  in  revealing  relations  among  heteroge¬ 
neous  data  for  anomaly  detection. 

The  paper  is  organized  as  follows.  Section  2  formulates  the  latent  space  track¬ 
ing  problem.  Section  3  presents  the  tracking  algorithm.  Section  4  further  designs 
an  anomaly  detection  method  as  an  application  of  the  learned  latent  space. 
Experimental  results  and  conclusions  are  in  Section  5  and  6  respectively. 
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2  Problem  Formulation 

Throughout  this  paper,  vectors  are  represented  by  lower-case  letters  (e.g.,  x ), 
and  matrices  are  represented  by  upper-case  letters  (e.g.,  U).  By  default,  all 
the  vectors  are  column  vectors,  while  row  vectors  are  represented  by  having  a 
transpose  superscript1^  (e.g.,  xT).  We  use  subscript  i  and  j  to  index  an  element 
in  a  matrix  (e.g.,  Vij)  and  subscript  t  to  index  a  data  point  at  timestamp  t  (e.g., 
xt )  in  a  data  steam.  The  estimate  of  a  variable  is  represented  by  having  a  hat 
over  the  variable  (e.g.,  U  represents  the  estimate  of  U). 

We  assume  Xt  £  R13*  and  yt  €  RDy  are  the  high-dimensional  heterogeneous 
data  samples  from  a  same  system  at  timestamp  t,  where  Dx  and  Dy  are  the 
number  of  features  in  Xt  and  yt:  respectively.  The  heterogeneity  of  particular 
interest  in  this  paper  is  that  Xt  s  features  are  correlated,  whereas  only  very  few 
features  in  yt  describe  the  states  of  the  system.  Heterogeneous  data  in  many  real- 
life  applications  exhibits  such  kind  of  property.  For  example,  during  a  speech,  yt 
can  be  data  recorded  by  articulatory  sensors,  which  are  highly  correlated  [3]  due 
to  connected  muscles.  In  contrast,  Xt  can  be  Mel-frequency  cepstrum  coefficients 
(MFCC).  Obtained  by  appending  higher  order  derivatives  of  acoustic  signal, 
it  contains  much  redundancy  and  often  need  a  feature  selection  [11]  step  before 
further  processing.  In  a  stock  market,  xt  could  be  the  prices  of  multiple  correlated 
stocks,  and  yt  is  massive  news  about  the  market  [14]. 

In  order  to  learn  the  underlying  structures  and  relations  among  Xt  and  yt, 
we  monitor  the  joint  probability  density  p(xt,yt)  at  each  timestamp  t: 

p{xtlyt)  =  p{yt\xt)p{xt).  (1) 

However,  since  both  xt  and  yt  are  of  high  dimensionality,  online  density  estima¬ 
tion  for  p(yt\xt)  or  p(xt)  is  prohibitively  difficult.  Therefore,  we  assume  there  is 
a  d  dimensional  latent  space  ( d  <C  Dx,Dy)  underlying  the  data,  into  which  Xt 
and  yt  can  be  transformed  via  two  linear  projectors  U  £  RDxXd  an(j  y  g  'RL>yxd. 
Their  projections  are  denoted  as  UT xt  and  VT yt,  respectively,  which  can  be  con¬ 
sidered  as  realizations  of  a  common  latent  variable  that  determines  the  states 
of  the  underlying  system.  U  and  V  will  exhibit  different  structures.  Specifically, 
while  U  may  span  a  low-rank  subspace  as  in  PCA,  V  may  model  a  latent  space 
impacting  only  a  subset  of  the  features  in  yt . 


3  Proposed  Approach 

We  constrain  U  to  be  orthonormal  (i.e. ,  UTU  =  I,  where  I  is  the  identity 
matrix)  to  preserve  the  magnitude  of  Xt .  Thus,  the  reconstruction  error  of  Xt  is 
\\xt  —  UUT Xt ||2.  In  this  case,  we  measure  the  probability  distribution  of  Xt  by 
the  reconstruction  error  [17]: 

p{xt)  oc  exp  (— ||xt  -  UUTXt\\2/ax)  , 


(2) 
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where  a j;  is  the  variance  of  reconstruction  error  in  each  dimension.  Since  the 
projections  of  Xt  and  yt  are  considered  as  realizations  of  a  common  latent  vari¬ 
able,  they  are  expected  to  be  close.  Hence,  we  measure  p(yt\xt)  by  the  distance 
of  the  projections  in  the  latent  space: 


p{yt\xt)  oc  exp  (-|| VTyt  -  UTxt\\2/al)  ,  (3) 

where  a y  is  the  variance  of  the  difference  between  Xt  and  yt  in  the  latent  space. 

By  substituting  Equation  (2)  and  (3)  into  (1)  and  taking  the  logarithm,  the 
log-likelihood  can  be  represented  as 


log p(xt,yt)  oc 


-\\VTyt-UTxt\\2 

n-2 


w(i-uuT)Xtr- 


In  addition,  we  constrain  V  to  exhibit  “group  sparse”  structure  so  that  applying 
V  performs  feature  selection  from  yt  to  identify  the  most  informative  features. 
We  use  the  mixed  norm  ||Vj|i)2  =  II ^7 II 2  to  introduce  sparsity  into  V, 

where  vj  is  the  *-th  row  of  V. 

To  enable  tracking  in  a  slowly  evolving  environment,  we  apply  an  exponen¬ 
tially  decaying  window  to  downweigh  the  historical  samples.  In  addition,  we 
define  a  =  cry/a%,  and  denote  the  estimates  of  U  and  V  at  timestamp  t  as  Ut 
and  Vj,  respectively.  Then  we  formulate  the  following  optimization  problem  to 
find  the  projectors  U  and  V  at  timestamp  t: 


( Ut,Vt )  =  argmin  F(U,V;t,  a,  a,  A) 

UT  (7=1,  v 

t-i  k  (4 

=  argmin  ^  %-(\\UT  xt-k  -  VT  yt-k\\2  +  oj|  (I -UUT)xt-k  ||2)  +  A||V||i,2, 

UTU=l,Vk= o  2  v  ' 


where  a  £  (0, 1]  is  a  forgetting  factor  over  historical  samples  to  implement 
the  decaying  window,  a  balances  between  projection  residual  and  discrepancy 
in  the  latent  space,  and  A  is  the  regularization  parameter  for  sparsity.  Note  that 
the  data  stream  starts  from  t  =  1. 

In  the  above  F(U,V\t,a,a,X),  the  first  term  measures  the  discrepancy  of 
two  data  sources  in  the  latent  space.  It  has  the  flavor  of  CCA  that  maximizes 
the  correlation  of  two  projections.  Same  as  PCA,  the  second  term  imposes  low¬ 
dimensional  structure  in  xt ■  It  is  important  to  highlight  the  ||H||i,2  term  here. 
||u7 1| 2  indicates  the  significance  of  the  i- th  feature  in  yt.  In  addition,  || W|| i;2  is 
invariant  if  multiplying  an  unitary  matrix  to  the  right  of  V.  Therefore,  the  cost 
of  (4)  depends  on  the  subspace  spanned  by  Ut  and  Vj  rather  than  the  particular 
basis  chosen. 


3.1  A  Batch  Algorithm 

We  first  present  a  batch  algorithm,  denoted  as  bLSTH,  to  solve  U  and  V  for 
simplicity.  The  bLSTH  algorithm  will  be  further  modified  into  an  online  version 
in  Section  3.2. 
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Algorithm  1.  The  Batch  Algorithm  bLSTH 
Input:  samples  A'  e  RDxXL,  Y  e  lD,jXi 

Parameters:  A,  a,  latent  dimension  d 

Output:  U  and  V 

i  <—  0,  17  [0]  <—  the  first  d  principal  components  of  X 

repeat 
i  <—  i  +  1 
Z  <r-U[i-  1]TX 

V[i\  <-  argmin  ^||PtF  -  ZfF  +  A||P||li2  (6) 

v  z 

W  <-V[i\TY 

U[i\  -  argmin  \  (|| UT X  -  W \\2F  +  a\\(l -UUT)X\\2F)  (7) 

uT u=i  z 

until  C7[i],  V[i]  converge  or  i  is  large  enough 

U+-U[i\,  V*-V[i] 


In  bLSTH,  L  buffered  samples  X  =  [x-l+ i,  •  •  •  ,  Xo],  Y  =  [y~L+ i,  ■  •  •  ,  yo\  are 
used  to  solve  the  following  optimization  problem: 

(U,V)=  argmin  1  (||17TA  -  VrY\\2F  +  a\\(I  -UUT)X\\2F)  +  A||P||1j2. 

UT  U=I,V  1  v  7 


We  use  an  alternating  method  to  solve  for  U  and  V ,  as  presented  in  Algorithm  1. 
The  optimization  problem  in  Equation  (6)  of  Algorithm  1  is  a  well-studied  con¬ 
vex  optimization  problem.  Now  we  focus  on  the  optimization  problem  in  Equa¬ 
tion  (7).  The  objective  can  be  reformulated  as: 


f(U ;  it)  4  I  (|| UTX  -  W\\2f  +  <7 H (I  —UUt)X\\2f) 

=  ^(1  —  a)  tr  {UT XXTU }  —  tr  {  (lWT)  UT}  , 


(5) 


where  UTU  =  I  and  W  =  VTY.  This  orthonormality  constrained  problem  is 
non-convex.  However,  we  are  able  to  find  a  local  minimum  within  a  few  itera¬ 
tions  and  our  experiments  show  that  even  local  minimum  is  able  to  give  good 
results.  Following  the  idea  in  [8],  we  use  a  majorization  minimization  scheme. 
The  basic  idea  is  to  construct  a  non-decreasing  sequence  /(f7[l|), . . . ,  f(U[k]), . . . 
that  converges  to  a  local  minimum  of  f(U).  Specifically,  suppose  we  are  at  U[k\, 
we  construct  a  surrogate  function  gk(U)  that  satisfies 


f(U)  <  gk(U)  and  f{U[k})  =  gk(U[k}).  (8) 


That  is,  gk(U)  is  an  upper  bound  of  f(U)  and  the  equality  holds  when  U  = 
U[k\.  Assign  the  global  minimizer  of  gk{U)  to  U[k  +  1],  thus  the  sequence 
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f(U[  1]), . . . ,  f(U[k]), ...  is  guaranteed  to  be  non-increasing  due  to  the  properties 
of  gk(U)  as  in  Equation  (8)  and  the  notion  of  global  minimizer.  In  practice,  a 
surrogate  function  should  be  constructed  such  that  its  global  minimizer  is  eas¬ 
ily  obtained.  The  following  two  lemmas  suggest  one  form  of  such  gk{U)  and  its 
global  minimizer. 

Lemma  1.  For  any  given  orthonormal  matrix  U[k\  £  M.DaiXd,  the  following 
gk(U'a)  defined  on  the  set  of  orthonormal  matrices  U  £ 

gk(U ;  a)  =  tr  {  [(1  -  <r)(IIT  -  al)U[k\  -  (XWr)]T  [/}  +  c 

is  a  surrogate  function  for  the  f(U ;  a)  in  Equation  (5),  where  c  is  some  constant 
independent  of  U .  And  the  scalar  a  chosen  as 

{  X*  a  <  1 

a  [0  (T>1  ’ 

where  A*  is  the  maximum  eigenvalue  of  XXT . 

Proof.  The  proof  leverages  Rayleigh  quotient  inequality  and  is  omitted  for  con¬ 
ciseness. 

Lemma  2.  [10]  The  global  minimizer  of 

min  —  tr{AT[/} 
uTu=i 

is  PQ'  ,  where  PYQT  =  A  is  the  Singular  Value  Decomposition  (SVD)  of  A. 

Using  Lemma  2,  the  global  minimizer  of  the  surrogate  function  gk(U\a)  has  a 
closed  form  argmin!7T[;=Isn!(17;a)  =  PQT , where  PYQT  is  the  SVD  of  XWT  — 
(1  —  cr)(XXT  —  oI)?7[fc].Thus,  by  applying  Lemma  1  and  2,  the  problem  in  Equa¬ 
tion  (7)  can  be  solved  via  the  iterative  majorization  minimization  process  as 
presented  in  Algorithm  2,  where  G  =  XWT  =  XYTV  and  Cx  =  XXT .  A 
special  case  is  when  a  =  1,  in  which  the  minimizer  of  f(U\<r)  is  given  by  the 
closed- form  solution  directly  by  Lemma  2. 

3.2  An  Online  Algorithm 

Here  we  derive  the  online  algorithm  LSTH  from  bLSTH.  We  use  the  solution  (U, 
V )  by  bLSTH  on  the  samples  X  =  [x-l+ i,  •  •  ■  ,  xo],  Y  =  [y~L+ i,  ■  ■  ■  ,  yo]  as  the 
initialization  (Uo,  Uo)  for  the  online  updates,  assuming  the  online  process  starts 
from  timestamp  t  =  1.  We  also  use  an  alternating  method  to  track  (Ut,  Vt)  with 
the  following  definition  of  projections  of  Xt  and  yt  into  the  latent  space: 

zt  =  Ut  xt,  Wt  =  vt  yt. 


The  online  algorithm  LSTH  consists  of  an  initialization  via  bLSTH  and  iterative 
online  updates  of  U  and  V ,  as  presented  in  Algorithm  3. 
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Algorithm  2.  Updating  U  for  bLSTH 

Input:  orthonormal  U ,  scalar  a ,  cross-covariance  matrix  G 
auto-covariance  matrix  Cx 
Output:  t/updated 
k  <-  0,  U[k]  <-  U 
repeat 

k  <—  k  +  1 

compute  SVD:  PEQT  =  G  —  (1  —  er)(Cx  —  al)f7[fc  —  1]  (9) 

U[k\  <r-  PQT 

until  U[k]  converged  or  k  is  large  enough 

^updated  ^U[k] 


Online  Tracking  of  Ut .  Upon  arrival  of  new  data  ( Xt,yt )  at  t,  we  use  Vj-i  to 
estimate  the  projection  of  yt  at  t  as  follows: 

Wt  =  VtZiyt.  (10) 

Substituting  the  Wt  into  Equation  (4),  we  will  see  that  the  objective  function 
of  U  is  of  the  same  form  as  (5),  except  that  the  historical  Xt  are  downweighed. 
Therefore  it  can  be  minimized  via  Algorithm  2  with  the  only  modification  that 
G  in  Equation  (9)  is  replaced  by  akxt-k™J-ki  and  C®  is  replaced  by 

YZk=o  ak xt-kxJ-k-  Both  of  these  two  summations  can  be  incrementally  updated. 


Online  Tracking  of  Vt.  Given  Ut  solved  as  in  Section  3.2,  we  use  Ut  to  estimate 
Zt  at  current  timestamp  t  as  follows 

zt  =  Ujxt.  (11) 


Substituting  %  into  Equation  (4),  we  can  get  the  following  objective  function 
w.r.t  V, 


W;*)  =  £ 


fc=0  L 


a 


t-j  ||  U  yt  —  k  Zt— A:  1 1  ^ 


Amir, 2. 


(12) 


For  the  above  problem,  we  derive  a  Stochastic  Coordinate  Descent  (SCD)  method 
with  a  similar  spirit  as  [9].  The  SCD  admits  a  row-wise  updating  of  Vt,  details 
can  be  found  in  Equation  (13)  in  Algorithm  3. 


3.3  Complexity  Analysis 

The  complexity  of  LSTH  is  0(c  •  D\d  +  D^d ) ,  where  c  •  D^d  is  due  to  the  SVD  step 
in  Equation  (9)  and  c  is  the  number  of  iterations  in  majorization  minimization 
for  U  (c  =  1  suffices  in  practice) .  Efficient  algorithms  for  computing  the  SVD  of  a 
sequentially  updated  matrix  [2]  can  be  applied  to  reduce  the  complexity.  Dy  ■  d  is 
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due  to  the  coordinate  descent  algorithm  on  V,  for  which  further  acceleration  can 
be  achieved  via  active  set  tricks.  Our  experiments  show  that  LSTH  is  sufficiently 
fast  for  real  applications,  for  example,  20  ms  for  the  XRMB  dataset  (sampling 
interval:  25  ms/sample).  The  experimental  details  will  be  presented  in  Section  5. 
To  reduce  the  complexity  of  LSTH  is  very  important  and  it  is  left  for  future 
exploration  for  now. 

4  Application:  Anomaly  Detection 

The  basic  idea  of  our  anomaly  detection  method  is  to  monitor  ||  UTXt  —  VT yt  || 2  + 
a\\(I—UUT)xt\\2 ■  We  define  the  a  priori  error: 


6  ^  II  uj_xxt  -  VtT_iyt\\2  +  <r\\xt  -  Ut-xU^Xtf, 


(14) 


and  use  as  the  detection  statistic.  An  anomaly  is  claimed  only  when  p(xt,yt) 
appears  to  be  significantly  small,  corresponding  to  being  significantly  large. 
We  maintain  a  sliding  window  over  with  the  mean  and  standard  deviation 
vt  within  the  window.  When  the  new  (xt+i,yt+i)  arrives,  we  compare  its 


Algorithm  3.  The  Online  Algorithm  LSTH 
Parameters:  d,  a,  A,  a 

Input:  data  stream:  •  •  •  ,  (x0lyo),-  ■  ■  ,  (xt,yt),  ■  ■  ■ 
Obtain  Uq  and  Vo  by  Algorithm  1 
for  t  =  1,2,...  do 


//update  Ut 
wt  <-  Vtliyt 


//update  Vt 
zt  <—  Uj  xt 


Calculate  the  i- th  row  of  Vt- 


where  S(-,X)  is  the  soft  thresholding  function  with  parameter  A. 

end  for 


end  for 
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with  a  threshold  bt  =  +  where  7  >  0  indicates  the  effect  of  variance.  Once 

£t+ 1  exceeds  the  threshold,  an  anomaly  is  claimed. 

Additional  care  need  to  be  taken  for  the  claimed  anomalous  data  points.  In 
specific,  if  the  anomaly  behaves  as  a  sudden  outlier  after  which  the  data  stream 
goes  back  to  normal  state,  then  the  anomalous  data  point  should  be  excluded 
for  model  updating.  The  other  case  is  that  the  anomaly  is  in  fact  the  start  of 
a  different  stage  in  the  data  stream,  then  the  anomalous  data  point  should  be 
included  in  model  updating.  These  two  cases  will  be  addressed  in  synthetic  and 
real  data  experiments  respectively. 

5  Experiments 

In  this  section,  we  conduct  comparative  experiments  to  demonstrate  the  perfor¬ 
mance  of  LSTH  in  tracking  the  latent  space  for  anomaly  detection.  All  types  of 
tracking  methods  as  well  as  their  corresponding  anomaly  detection  statistics  are 
summarized  in  Table  1. 

Table  1.  Latent  space  tracking  methods  and  corresponding  detection  statistics 


method 

detection  statistics 

semantics 

latent  discrepancy 

LSTH 

ft  =  m_lXt  -  VTlW.II2  +  cr\\xt  -  f/t-i&T^tll2 

and  projection 

residual 

projection  residual 

(online) 

St  =  CXjt-irx,t/Dx  +  ry  t6ytt-irytt/Dy  where 

onto  Generalized 

CCA 

r«,t  =  (C'hi  -  Ut-lUj_x)xt  and  r„it  =  -  Vt_iV,Il)yt 

Stiefel  manifold 

[19] 

(online) 

projection  residual 

PCAx 

zx,t  =  m-ut^uU)xtf 

onto  individual  or 

PCAy 

ey^UI-Ut^UlM2 

joint  signal  subspace 

PCAxy 

£xy,t  =  \\(J-—Ut-iUj i)[xt\yt]\\2 

[4] 

5.1  Experiments  on  Synthetic  Data 

We  generated  a  synthetic  dataset  with  continuous  data  xt  £  M500  and  sparse, 
discrete  and  non-negative  data  yt  £  R1000.  The  xt’s  are  generated  via  a  linear 
model  Xt  =  A9t  +  rit,  t  =  1,...,  10500.  where  A  £  jjsooxio^  ^  g  R10  an(j 
rit  is  white  Gaussian  noise.  The  yt’s  are  generated  as  of  dimension  1000.  The 
first  50  features  of  yt  s  are  relevant  to  the  underlying  system,  generated  via 
B9t  +  mt ,  t  =  1, . . . ,  10500,  where  B  £  R50xl°  and  rrit  is  white  Gaussian  noise. 
The  rest  950  dimensions  are  padded  as  noise.  We  introduced  sparsity  into  yt 
by  randomly  setting  half  of  its  values  to  zero.  In  the  end  we  round  the  yt  to 
non-negative  integers.  In  this  way,  yt  is  analogous  to  the  real-world  documents 
in  bag-of-words  representation.  In  this  generated  dataset,  we  introduced  three 
types  of  anomalies,  all  of  them  are  sudden  outliers. 
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Table  2.  Synthetic  dataset:  AUC  and  parameters 


method 

Typ 

e-1 

Typ 

e-2 

Type-3 

AUC 

parameters 

AUC 

parameters 

AUC 

parameters 

LSTH 

0.863/0.860 

10,10,1,10 

0.995/0.993 

10,10,1,10 

0.984/0.979 

10,20,1,0 

CCA 

0.848/0.859 

10 

0.020/0.019 

10 

0.971/0.950 

500 

PCAx 

0.500/0.525 

10,  1 

0.015/0.018 

10,  1 

0.013/0.016 

10,  1 

PCAxy 

0.644/0.662 

20,  1 

0.744/0.730 

20,  1 

0.977/0.971 

20,  1 

PCAy 

0.298/0.365 

10,  1 

0.015/0.015 

10,  1 

0.977/0.960 

20,  1 

The  parameters  for  LSTH  are  d  (dimension  of  the  latent  space),  A,  a.  and  a,  respectively,  The  param¬ 
eter  for  CCA  is  d.  The  parameters  for  PCAx,  PCAxy  and  PCAy  are  d  and  the  forgetting  factor,  respec¬ 
tively.  AUC  of  the  precision-recall  plot  is  used  for  evaluation;  the  larger  the  AUC  value  is,  the  better 
the  performance  is.  The  values  under  AUC  column  (i.e.,  x/y )  are  the  performance  on  training  and 
testing  set,  respectively.  Bold  numbers  correspond  to  the  best  performance  for  each  anomaly  type 
among  all  the  methods. 


Type-1  anomaly:  at  t  =  500,  600, . . . ,  10400,  xt  is  distorted  to  xt  =  A9t+nt, 
where  A  is  identical  to  A  except  that  one  row  of  A  is  randomly  re-drawn  from 
A/”(0,1).  At  the  same  timestamps  when  A  is  distorted,  B  in  generating  yt  is 
also  distorted  to  B  by  randomly  re-drawing  5  of  its  rows  from  A/"(l,  0.32).  This 
corresponds  to  the  scenario  when  both  xt  and  yt  behave  anomalously  at  same 
time. 

Type-2  anomaly:  at  t  =  500,  600, . . . ,  10400,  only  Xt  is  distorted  to  Xt  = 
A9t  +  nt  with  9t  ~  7^(3. 5,1),  that  is,  the  latent  variable  9t  is  distorted.  In 
this  way,  a  discrepancy  is  introduced  between  the  latencies  of  Xt  and  yt .  This 
corresponds  to  the  scenario  when  Xt  has  anomalies  but  yt  behaves  normally. 

Type-3  anomaly:  At  t  =  500,  600, . . . ,  10400,  three  relevant  features  and 
three  among  the  rest  950  features  of  yt  are  exchanged.  This  corresponds  to  the 
scenario  when  some  relevant  features  in  yt  are  changed  while  Xt  remains  normal. 


Experimental  Results  on  Synthetic  Data.  We  compare  all  methods  in 
Table  1  for  anomaly  detection  task.  For  all  the  methods,  the  first  100  samples 
are  used  for  initialization.  The  7  in  computing  detection  threshold  is  varied  to 
produce  a  full  precision-recall  plot.  The  parameters  are  selected  as  the  ones  that 
maximize  the  Area  Under  Curve  (AUC)  of  the  precision-recall  plot  on  a  training 
set  generated  separately  from  the  same  data  generation  protocol.  Results  are  pre¬ 
sented  in  Table  2.  For  the  three  types  of  anomalies,  LSTH  consistently  achieves  the 


25  50  T  ,  75 .  ,  100  125  •  ■  •  1000 

V  row  index 


Fig.  1.  Feature  selection  effects  of  LSTH 


Table  3.  XRMB  results 


method 

AUC 

parameters 

LSTH 

0.342 

20,300,0.95,1000 

CCA 

0.045 

30 

PCAx 

0.035 

20,  le-5 

PCAxy 

0.035 

30,  le-5 

PCAy 

0.033 

30,  le-5 

The  parameters  for  each  method  are 
same  as  those  in  Table  2  in  paper. 
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best  detection  performance.  CCA  is  competitive  for  Type-1  and  Type-3  anomalies 
but  completely  fails  for  Type-2,  due  to  the  fact  that  its  detection  statistic  cannot 
capture  the  changes  in  the  signal/latent  space.  The  failure  of  PCAx  on  Type-2  has 
a  same  reason  as  that  of  CCA  on  Type-2.  On  average,  PCA  based  methods  perform 
worst  among  all  the  methods  except  for  Type-3.  However,  by  joining  two  data 
sources  properly,  PCAxy  is  able  to  detect  the  change  of  the  “joint”  subspace  so 
as  to  achieve  better  performance  than  PCAx  and  PCAy. 

Figure  1  shows  the  norm  of  each  row  of  the  learned  V,  after  all  the  updates 
of  LSTH  at  t  =  10500.  For  the  relevant  (i  =  1, . . .  ,50)  features  in  yt,  ||u7 1|  are 
non-zero.  For  the  irrelevant  features  (*  >  50),  ||id||  are  zero  or  very  small.  This 
demonstrates  that  LSTH  can  successfully  identify  the  relevant  features  via  the 
mixed  norm  on  V. 


5.2  Experimental  Results  on  Real  Data:  XRMB 

XRMB  [16]  contains  synchronous  273-dim  MFCC  and  112-dim  articulatory  infor¬ 
mation  of  length  51K.  Each  timestamp  has  a  label  indicating  which  word  it 
corresponds  to.  Details  on  the  data  are  available  in  [1].  Speech  segmentation 
has  attracted  lots  of  attention  for  treating  related  diseases  [15].  The  task  in  our 
experiment  is  to  detect  the  boundary  of  words  from  acoustic  and  articulatory 
features.  During  each  segment,  a  tracking  algorithm,  e.g.,  LSTH,  gradually  learns 
the  underlying  latent  subspace.  Upon  arrival  of  a  new  segment,  the  underlying 
latent  space  has  a  sudden  change.  This  event  may  induce  a  drastic  change  of  the 
detection  statistics  provided  by  the  tracking  algorithms,  and  therefore  is  consid¬ 
ered  as  an  anomaly.  In  this  case,  the  claimed  anomalous  data  point  should  be 
incorporated  in  learning  the  new  latent  space  in  the  new  segment. 

When  applying  LSTH,  we  assign  to  Xt  the  articulatory  features  with  highly 
correlated  dimensions  [3].  And  yt  is  designated  as  the  MFCC,  which  is  redun¬ 
dant  and  sparse  filtering  has  been  shown  necessary  for  feature  selection  [11].  We 
randomly  select  1000  frames  for  parameter  tuning  for  all  the  methods,  and  use 
the  tuned  parameters  for  testing  on  the  rest  of  the  frames.  Figure  2  shows  the 
detection  statistics  of  all  methods  on  the  parameter  tuning  dataset.  Out  of  25 
words  within  the  1000  frames,  LSTH  is  able  to  identify  15  words  with  clear  and 
strong  spikes  in  the  detection  statistics.  After  each  alarm  of  anomaly  (start  of 
a  new  segment),  it  quickly  adapts  to  the  new  latent  space  in  the  new  segment. 
PCA  based  methods  only  show  weak  spikes.  CCA  fails  in  this  case,  as  the  conclu¬ 
sion  in  [1].  Based  on  their  results,  kernel  CCA  should  be  a  better  approach  on 
this  dataset  than  CCA.  However,  there  is  not  a  meaningful  detection  statistic  for 
kernel  CCA,  so  we  leave  this  approach  for  later  research. 

We  then  applied  all  the  methods  on  the  rest  of  the  data  with  their  optimal 
parameters  tuned  on  the  training  set.  The  parameters  and  the  performance  of 
different  methods  are  presented  in  Table  3.  LSTH  has  an  AUC  value  0.342  (note 
that  a  random  guess  would  give  an  AUC  of  412/51000  =  0.008)  and  it  is  the 
only  method  that  can  detect  the  boundaries  of  the  words  from  XRMB  dataset. 
All  the  other  methods  fail  with  AUC  values  smaller  than  0.05. 
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Fig.  2.  Detection  statistics  on  XRMB  training  data 


6  Conclusions  and  Discussions 

We  developed  LSTH,  a  latent  space  tracking  method  for  heterogeneous  stream¬ 
ing  data.  Under  the  assumption  that  anomalies  significantly  deviate  from  the 
latent  space,  we  further  designed  an  anomaly  detection  method  based  on  LSTH. 
Experimental  results  demonstrate  that  LSTH’s  detection  statistics  outperform 
the  other  state-of-the-art  in  identifying  anomalies.  Therefore  LSTH  better  char¬ 
acterizes  the  latent  structure  of  heterogeneous  data  than  does  the  other  methods. 
Future  work  on  LSTH  includes  non-linear  mapping  into  the  latent  space  via  ker- 
nelization,  online  supervised  learning  in  the  latent  space,  and  extending  to  cases 
with  more  than  two  views  of  a  system. 
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